Less is More: Learning Prominent and Diverse Topics for Data Summarization
نویسندگان
چکیده
Statistical topic models efficiently facilitate the exploration of large-scale data sets. Many models have been developed and broadly used to summarize the semantic structure in news, science, social media, and digital humanities. However, a common and practical objective in data exploration tasks is not to enumerate all existing topics, but to quickly extract representative ones that broadly cover the content of the corpus, i.e., a few topics that serve as a good summary of the data. Most existing topic models fit exactly the same number of topics as a user specifies, which have imposed an unnecessary burden to the users who have limited prior knowledge. We instead propose new models that are able to learn fewer but more representative topics for the purpose of data summarization. We propose a reinforced random walk that allows prominent topics to absorb tokens from similar and smaller topics, thus enhances the diversity among the top topics extracted. With this reinforced random walk as a general process embedded in classical topic models, we obtain diverse topic models that are able to extract the most prominent and diverse topics from data. The inference procedures of these diverse topic models remain as simple and efficient as the classical models. Experimental results demonstrate that the diverse topic models not only discover topics that better summarize the data, but also require minimal prior knowledge of the users.
منابع مشابه
A Survey of Text Summarization Techniques
Numerous approaches for identifying important content for automatic text summarization have been developed to date. Topic representation approaches first derive an intermediate representation of the text that captures the topics discussed in the input. Based on these representations of topics, sentences in the input document are scored for importance. In contrast, in indicator representation ap...
متن کاملA survey on Automatic Text Summarization
Text summarization endeavors to produce a summary version of a text, while maintaining the original ideas. The textual content on the web, in particular, is growing at an exponential rate. The ability to decipher through such massive amount of data, in order to extract the useful information, is a major undertaking and requires an automatic mechanism to aid with the extant repository of informa...
متن کاملمقایسه روشهای مختلف یادگیری ماشین در خلاصهسازی استخراجی گفتار به گفتار فارسی بدون استفاده از رونوشت
In this paper, extractive speech summarization using different machine learning algorithms was investigated. The task of Speech summarization deals with extracting important and salient segments from speech in order to access, search, extract and browse speech files easier and in a less costly manner. In this paper, a new method for speech summarization without using automatic speech recognitio...
متن کاملTopic-based web site summarization
Purpose Summarization of an entire Web site with diverse content may lead to a summary heavily biased towards the site’s dominant topics. This paper presents a novel topic-based framework to address this problem. Design/methodology/approach A two-stage framework is proposed. The first stage identifies the main topics covered in a Web site via clustering and the second stage summarizes each topi...
متن کاملUsing Machine Learning Techniques for Subjectivity Analysis based on Lexical and Non-lexical Features
Machine learning techniques have been used to address various problems and classification of documents is one of the main applications of such techniques. Opinion mining has emerged as an active research domain due to its wide range of applications such as multi-document summarization, opinion mining of documents and users’ reviews analysis improving answers of opinion questions in forums. Exis...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1611.09921 شماره
صفحات -
تاریخ انتشار 2016